Problem 1

Classification is the process which finds a model to categorize data into unique groups. Regression models the data into a continuous function between two or more variables. The main difference between classification and regression is how the data are treated. In classification, the data are treated as categorical. In regression, the data are treated as continuous. Both methods use the data to create a model, but the models are different in nature.

Problem 2

Load the data.

adultdata <- read.csv("~/Desktop/adultdata.txt")

2.a

Data descriptions (as reported for the 1990 census):

  • age is the age of an individual; integer.
  • workclass is the work class of an individual; string (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.)
  • fnlwgt is unknwon as there is not enough information to describe the variable; integer.
  • education is the completed education of an individual; string (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.)
  • education.num is the cooresponding number to the completed education of an individual; integer.
  • marital.status is the marital status of an individual; string (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.)
  • occupation is the occupation of an individual; string (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.)
  • relationship is the relationship of the individual to a specific household; string (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.)
  • race is the race of an individual; string (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.)
  • sex is the sex of an individual; string (Female, Male.)
  • capital.gain is the capital gain of an individual; integer.
  • capital.loss is the capital loss of an individual; integer.
  • hours.per.week is the number of hours worked per week by an individual; integer.
  • native.country is the native country of an individual; string (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.)
  • income is the income class of an individual; string (<=50K, >50K.)

2.b

adultdata[adultdata == " ?"] <- NA
apply(is.na(adultdata),2,sum)/nrow(adultdata)*100
##            age      workclass         fnlwgt      education  education.num 
##       0.000000       5.638647       0.000000       0.000000       0.000000 
## marital.status     occupation   relationship           race            sex 
##       0.000000       5.660146       0.000000       0.000000       0.000000 
##   capital.gain   capital.loss hours.per.week native.country         income 
##       0.000000       0.000000       0.000000       1.790486       0.000000

From the code above…

  1. workclass has 5.638647% missing values
  2. occupation has 5.660146% missing values.
  3. native.country has 1.790486% missing values.
  4. All other variables have no missing values.

The histogram below shows the distribution of missing values for each row in the data set.

hist(apply(is.na(adultdata),1,sum),
     xlab = "Number of Missing Values Per Row",
     ylab = "Frequency",
     main = "Number of Missing Values",
     breaks = 3,
     col = "blue")

2.c

The following variables are numeric:

  • age
  • fnlwgt
  • education.num
  • captial.gain
  • capital.loss
  • hours.per.week

The following variables are categorical:

  • workclass
  • education
  • marital.status
  • occupation
  • relationship
  • race
  • sex
  • native.country
  • income

Split these variables into partitioned sets for future use.

adultnumeric <- adultdata[,c(1,3,5,11,12,13)]
adultcat <- adultdata[,-c(1,3,5,11,12,13)]

2.d

Create a histogram for each numeric variable with length(x) bins if length(x) < 100, else with 100 bins.

apply(adultnumeric, 2, function(x) ifelse(length(unique(x)) < 100,
              hist(x, 
                   breaks = length(unique(x)),
                   main = "",
                   xlab = ""),
              hist(x, 
                   breaks = 100,
                   main = "",
                   xlab = "")))

## $age
## $age[[1]]
##  [1] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
## [24] 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
## [47] 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [70] 86 87 88 89 90
## 
## 
## $fnlwgt
## $fnlwgt[[1]]
##  [1]       0   20000   40000   60000   80000  100000  120000  140000
##  [9]  160000  180000  200000  220000  240000  260000  280000  300000
## [17]  320000  340000  360000  380000  400000  420000  440000  460000
## [25]  480000  500000  520000  540000  560000  580000  600000  620000
## [33]  640000  660000  680000  700000  720000  740000  760000  780000
## [41]  800000  820000  840000  860000  880000  900000  920000  940000
## [49]  960000  980000 1000000 1020000 1040000 1060000 1080000 1100000
## [57] 1120000 1140000 1160000 1180000 1200000 1220000 1240000 1260000
## [65] 1280000 1300000 1320000 1340000 1360000 1380000 1400000 1420000
## [73] 1440000 1460000 1480000 1500000
## 
## 
## $education.num
## $education.num[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
## 
## 
## $capital.gain
## $capital.gain[[1]]
##   [1]      0   1000   2000   3000   4000   5000   6000   7000   8000   9000
##  [11]  10000  11000  12000  13000  14000  15000  16000  17000  18000  19000
##  [21]  20000  21000  22000  23000  24000  25000  26000  27000  28000  29000
##  [31]  30000  31000  32000  33000  34000  35000  36000  37000  38000  39000
##  [41]  40000  41000  42000  43000  44000  45000  46000  47000  48000  49000
##  [51]  50000  51000  52000  53000  54000  55000  56000  57000  58000  59000
##  [61]  60000  61000  62000  63000  64000  65000  66000  67000  68000  69000
##  [71]  70000  71000  72000  73000  74000  75000  76000  77000  78000  79000
##  [81]  80000  81000  82000  83000  84000  85000  86000  87000  88000  89000
##  [91]  90000  91000  92000  93000  94000  95000  96000  97000  98000  99000
## [101] 100000
## 
## 
## $capital.loss
## $capital.loss[[1]]
##  [1]    0   50  100  150  200  250  300  350  400  450  500  550  600  650
## [15]  700  750  800  850  900  950 1000 1050 1100 1150 1200 1250 1300 1350
## [29] 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050
## [43] 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750
## [57] 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450
## [71] 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150
## [85] 4200 4250 4300 4350 4400
## 
## 
## $hours.per.week
## $hours.per.week[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
## [70] 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
## [93] 93 94 95 96 97 98 99

Create two side-by-side histograms for each numeric variables. One historgram (blue) is the oberservations where income = "<=50K". The other (red) is the observations where income = ">50K".

lesser <- adultdata[which(adultdata$income == " <=50K"), ]
greater <- adultdata[which(adultdata$income == " >50K"), ]

for(i in c(1,3,5,11,12,13)){
par(mfrow=c(1,2))
hist(lesser[,i],
     xlab = names(lesser)[i],
     ylab = "Frequency",
     main = "",
     col = "blue")
hist(greater[,i],
     xlab = names(greater)[i],
     ylab = "Frequency",
     main = "",
     col = "red")
}

Create two side-by-side boxplots. The color coding is the same.

for(i in c(1,3,5,11,12,13)){
  par(mfrow=c(1,2))
  boxplot(lesser[,i],
       xlab = names(lesser)[i],
       ylab = "Frequency",
       main = "",
       col = "blue")
  boxplot(greater[,i],
       xlab = names(greater)[i],
       ylab = "Frequency",
       main = "",
       col = "red")
}

par(mfrow = c(1,1))

Based on the histograms and box plots, the data are similarly distributed regardless of income class. There appears to be a greater number of higher income individuals with higher education. Higher income individuals also have a higher chance of working more hours per week.

2.c

Create a bar plot for each variable representing the frequency of each unique value.

for(i in c(2,4,6,7,8,9,10,14,15)){
barplot(table(adultdata[,i]), 
        las = 2, 
        cex.names = 0.65,
        main = "",
        xlab = names(adultcat)[i],
        col = "blue")
}

Create two side-by-side bar charts for each categorical variables. One bar chart (blue) is the oberservations where income = "<=50K". The other (red) is the observations where income = ">50K".

for(i in c(2,4,6,7,8,9,10,14,15)){
  par(mfrow=c(1,2))
  barplot(table(lesser[,i]), 
          las = 2, 
          cex.names = 0.65,
          main = "",
          xlab = names(lesser)[i],
          col = "blue")
  barplot(table(greater[,i]), 
          las = 2, 
          cex.names = 0.65,
          main = "",
          xlab = names(greater)[i],
          col = "red")
}

par(mfrow = c(1,1))

One thing that the data shows is that more higher income individuals have a masters degree than lower income invididuals. Less higher income individuals are divorsed than lower income individuals.

2.f

plot(y = adultdata$education.num, x = adultdata$age,
     xlab = "Age",
     ylab = "Education",
     main = "Education vs. Age")

cor(adultdata$education.num, adultdata$age)
## [1] 0.03652719

There is almost no relationship between education and age. The correlation is near zero, implying the lack of relationship. The plot shows no intuitive pattern(s) either.

plot(x = adultnumeric$capital.gain, y = adultnumeric$hours.per.week,
     xlab = "Capital Gain",
     ylab = "Hours Per Week",
     main = "Capital Gain vs. Hours Per Week")

cor(adultnumeric$capital.gain, adultnumeric$hours.per.week)
## [1] 0.07840862

There is no relationship between capital gain and hours worked per week. The correlation value between these two variables is almost zero, implying no relationship between the two variables.

Problem 3

3.a

Load the data.

nfl.passing.2014 <- read.csv("~/Google Drive/Academic/Spring 2016/CS 4821/nfl-passing-2014.csv")
summary(nfl.passing.2014$TD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    2.00    8.17   12.75   39.00
names(sort(-table(nfl.passing.2014$TD)))[1] # mode
## [1] "0"

The mean of TD is 8.17, the median is 2.00, and the mode is 0.

summary(nfl.passing.2014$Int)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     2.0     4.5     9.0    18.0
names(sort(-table(nfl.passing.2014$Int)))[1] # mode
## [1] "0"

The mean of Int is 4.5, the median is 2.00, and the mode is 0.

quantile(nfl.passing.2014$Yds)
##      0%     25%     50%     75%    100% 
##    0.00   19.25  274.50 2653.25 4671.00

The lower and upper quantiles for YDS are 19.25 and 2653.25, respectively.

quantile(nfl.passing.2014$Rate)
##     0%    25%    50%    75%   100% 
##  27.90  55.45  84.15  96.40 158.30

The lower and upper quantiles for RATE are 55.45 and 96.40, respectively.

fivenum(nfl.passing.2014$Cmp.)
## [1]   0.0  50.0  60.4  67.0 100.0

The five number summary of CMP. is above.

fivenum(nfl.passing.2014$Yds)
## [1]    0.0   19.0  274.5 2710.0 4671.0

The five number summary of YDS is above.

Create a histogram of YARDS with 4, 8, and 12 bins respectively.

hist(nfl.passing.2014$Yds,
     xlab = "Passing Yards",
     main = "Histogram of Passing Yards",
     col = "blue",
     breaks = 4)

hist(nfl.passing.2014$Yds,
     xlab = "Passing Yards",
     main = "Histogram of Passing Yards",
     col = "blue",
     breaks = 8)

hist(nfl.passing.2014$Yds,
     xlab = "Passing Yards",
     main = "Histogram of Passing Yards",
     col = "blue",
     breaks = 12)

Something strange is happening with these plots. Despite setting each break value to 4, 8, and 12, respectively, they do not appear to be binning correctly.

Compare the distribution of the of the number of YARDS of each quarter back based on whether they threw less than 12 interceptions.

lessthan12 <- nfl.passing.2014[which(nfl.passing.2014$Int < 12), ]
hist(lessthan12$Yds,
     xlab = "Passing Yards",
     main = "Distribution of Passing Yards")

Draw a scatter plot of TD vs. INT.

plot(x = nfl.passing.2014$TD,
     y = nfl.passing.2014$Int,
     xlab = "Passing Touchdowns",
     ylab = "Passing Interceptions",
     main = "Passing Touchdowns vs. Interceptions")

3.b

Load the data.

tennis <- read.csv("~/Google Drive/Academic/Spring 2016/CS 4821/2014w.csv")

List top 5 players who have played a minimum of 20 matches with the highest average number of games won in a match.

twenty <- tennis[tennis$Winner %in% names(subset(table(tennis$Winner), table(tennis$Winner) >= 20)), ]
twenty <- twenty[, c(10, 16,18,20)]
twenty$avg <- rowMeans(twenty[,c(2,3,4)], na.rm = TRUE)
head(twenty[order(-twenty$avg),])
##            Winner W1 W2 W3      avg
## 1421 Pliskova Ka.  6  6 10 7.333333
## 12    Ivanovic A.  7  7 NA 7.000000
## 129     Kerber A.  7  7 NA 7.000000
## 168       Keys M.  6  6  9 7.000000
## 338       Peng S.  7  7 NA 7.000000
## 645     Garcia C.  7  7 NA 7.000000

From the code above, the top 5 players are:

  1. Plaskova Ka.
  2. Ivanovic A.
  3. Kerber A.
  4. Keys M.
  5. Peng S.

3.c

Load the data.

wc.data.2014 <- read.csv("~/Google Drive/Academic/Spring 2016/CS 4821/wc-data-2014.csv")
groupStage <- wc.data.2014[1:48,]
groupStage$home_dif <- groupStage$home_score - groupStage$away_score
groupStage$away_dif <- groupStage$away_score - groupStage$home_score

sort(tapply(groupStage$home_dif, groupStage$home, FUN=sum) + 
       tapply(groupStage$away_dif, groupStage$away, FUN=sum),
     decreasing = TRUE)
##    Columbia Netherlands      France      Brazil     Germany   Argentina 
##           7           7           6           5           5           3 
##     Belgium  Costa Rica      Mexico       Chile     Algeria Switzerland 
##           3           3           3           2           1           1 
##      Bosnia     Croatia     Ecuador     Nigeria     Uruguay         USA 
##           0           0           0           0           0           0 
##       Italy Ivory Coast      Russia     England       Ghana      Greece 
##          -1          -1          -1          -2          -2          -2 
##        Iran       Korea    Portugal       Spain       Japan   Australia 
##          -3          -3          -3          -3          -4          -6 
##    Honduras    Cameroon 
##          -7          -8

From the code above, Columbia and the Netherlands have the best differential.

sort(tapply(groupStage$home_dif, groupStage$home_continent, FUN=sum) + 
       tapply(groupStage$away_dif, groupStage$away_continent, FUN=sum),
     decreasing = TRUE)
## South America        Europe      CONCACAF        Africa          Asia 
##            17            10            -1           -10           -16

From the code above, South America has the best differential.